Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The integration of document image processing and text retrieval principles

Identifieur interne : 003061 ( Main/Exploration ); précédent : 003060; suivant : 003062

The integration of document image processing and text retrieval principles

Auteurs : Nil Van Der Merwe [Afrique du Sud]

Source :

RBID : ISTEX:1B78DFC63722E09B83AD64D444F2D260FE360F24

Abstract

This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an electronic document database that broadens the user's capability to retrieve relevant information more accurately, without going through costly processes to get paper documents into electronic text. The principles of document image processing systems, as well as the problems and shortcomings of most of today's document image processing systems, will be discussed. Then concept retrieval as the latest development in text retrieval will be discussed, with specific reference to the ability of the TOPIC intelligent text retrieval system to allow users to build up a knowledge base of search objects or concepts that can be used at any point in time by all users for the system. This paper will further specifically look at the automatic processing of paper documents by converting the scanned document image pages through to electronic text. The use of optical character recognition technology, the indexing and loading of the documents in a text database, the automatic linking of the documents to the related document images and the retrieval technology available in TOPIC, specifically the TYPO operator that was developed to handle socalled dirty data such as the common misspellings, character transpositions and dirty text received as output from the OCR process, will be discussed. A possible solution to load paper documents quickly and costeffectively into an electronic document database will be discussed and demonstrated in detail. The advantages and disadvantages of this approach will be discussed with specific reference to an electronic news clipping service application.

Url:
DOI: 10.1108/eb045245


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The integration of document image processing and text retrieval principles</title>
<author wicri:is="90%">
<name sortKey="Van Der Merwe, Nil" sort="Van Der Merwe, Nil" uniqKey="Van Der Merwe N" first="Nil" last="Van Der Merwe">Nil Van Der Merwe</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:1B78DFC63722E09B83AD64D444F2D260FE360F24</idno>
<date when="1993" year="1993">1993</date>
<idno type="doi">10.1108/eb045245</idno>
<idno type="url">https://api.istex.fr/document/1B78DFC63722E09B83AD64D444F2D260FE360F24/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000062</idno>
<idno type="wicri:Area/Istex/Curation">000061</idno>
<idno type="wicri:Area/Istex/Checkpoint">002321</idno>
<idno type="wicri:doubleKey">0264-0473:1993:Van Der Merwe N:the:integration:of</idno>
<idno type="wicri:Area/Main/Merge">003232</idno>
<idno type="wicri:Area/Main/Curation">003061</idno>
<idno type="wicri:Area/Main/Exploration">003061</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">The integration of document image processing and text retrieval principles</title>
<author wicri:is="90%">
<name sortKey="Van Der Merwe, Nil" sort="Van Der Merwe, Nil" uniqKey="Van Der Merwe N" first="Nil" last="Van Der Merwe">Nil Van Der Merwe</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Afrique du Sud</country>
<wicri:regionArea>Xcel, PO Box 20355, Alkantrant 0005</wicri:regionArea>
<wicri:noRegion>Alkantrant 0005</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">The Electronic Library</title>
<idno type="ISSN">0264-0473</idno>
<imprint>
<publisher>MCB UP Ltd</publisher>
<date type="published" when="1993-04-01">1993-04-01</date>
<biblScope unit="volume">11</biblScope>
<biblScope unit="issue">4/5</biblScope>
<biblScope unit="page" from="273">273</biblScope>
<biblScope unit="page" to="278">278</biblScope>
</imprint>
<idno type="ISSN">0264-0473</idno>
</series>
<idno type="istex">1B78DFC63722E09B83AD64D444F2D260FE360F24</idno>
<idno type="DOI">10.1108/eb045245</idno>
<idno type="filenameID">2630110410</idno>
<idno type="original-pdf">2630110410.pdf</idno>
<idno type="href">eb045245.pdf</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0264-0473</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper will discuss the integration of document image processing and text retrieval principles in order to process and load existing paper documents automatically in an electronic document database that broadens the user's capability to retrieve relevant information more accurately, without going through costly processes to get paper documents into electronic text. The principles of document image processing systems, as well as the problems and shortcomings of most of today's document image processing systems, will be discussed. Then concept retrieval as the latest development in text retrieval will be discussed, with specific reference to the ability of the TOPIC intelligent text retrieval system to allow users to build up a knowledge base of search objects or concepts that can be used at any point in time by all users for the system. This paper will further specifically look at the automatic processing of paper documents by converting the scanned document image pages through to electronic text. The use of optical character recognition technology, the indexing and loading of the documents in a text database, the automatic linking of the documents to the related document images and the retrieval technology available in TOPIC, specifically the TYPO operator that was developed to handle socalled dirty data such as the common misspellings, character transpositions and dirty text received as output from the OCR process, will be discussed. A possible solution to load paper documents quickly and costeffectively into an electronic document database will be discussed and demonstrated in detail. The advantages and disadvantages of this approach will be discussed with specific reference to an electronic news clipping service application.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Afrique du Sud</li>
</country>
</list>
<tree>
<country name="Afrique du Sud">
<noRegion>
<name sortKey="Van Der Merwe, Nil" sort="Van Der Merwe, Nil" uniqKey="Van Der Merwe N" first="Nil" last="Van Der Merwe">Nil Van Der Merwe</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 003061 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 003061 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:1B78DFC63722E09B83AD64D444F2D260FE360F24
   |texte=   The integration of document image processing and text retrieval principles
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024